25 research outputs found

    Dynamic automatic differentiation of GPU broadcast kernels

    Get PDF
    We show how forward-mode automatic differentiation (AD) can be employed within larger reverse-mode computations to dynamically differentiate broadcast operations in a GPU-friendly manner. Our technique fully exploits the broadcast Jacobian's inherent sparsity structure, and unlike a pure reverse-mode approach, this "mixed-mode" approach does not require a backwards pass over the broadcasted operation's subgraph, obviating the need for several reverse-mode-specific programmability restrictions on user-authored broadcast operations. Most notably, this approach allows broadcast fusion in primal code despite the presence of data-dependent control flow. We discuss an experiment in which a Julia implementation of our technique outperformed pure reverse-mode TensorFlow and Julia implementations for differentiating through broadcast operations within an HM-LSTM cell update calculation

    Automated Translation and Accelerated Solving of Differential Equations on Multiple GPU Platforms

    Full text link
    We demonstrate a high-performance vendor-agnostic method for massively parallel solving of ensembles of ordinary differential equations (ODEs) and stochastic differential equations (SDEs) on GPUs. The method is integrated with a widely used differential equation solver library in a high-level language (Julia's DifferentialEquations.jl) and enables GPU acceleration without requiring code changes by the user. Our approach achieves state-of-the-art performance compared to hand-optimized CUDA-C++ kernels, while performing 20−100×20-100\times faster than the vectorized-map (\texttt{vmap}) approach implemented in JAX and PyTorch. Performance evaluation on NVIDIA, AMD, Intel, and Apple GPUs demonstrates performance portability and vendor-agnosticism. We show composability with MPI to enable distributed multi-GPU workflows. The implemented solvers are fully featured, supporting event handling, automatic differentiation, and incorporating of datasets via the GPU's texture memory, allowing scientists to take advantage of GPU acceleration on all major current architectures without changing their model code and without loss of performance.Comment: 11 figure

    Abstractions for programming graphics processors in high-level programming languages

    No full text

    Metal.jl

    No full text
    Metal v0.5.1 Diff since v0.5.0 Merged pull requests: MPSMatrix improvements (#157) (@tgymnich) Update manifest (#221) (@github-actions[bot]) Update manifest (#222) (@github-actions[bot]) Update manifest (#224) (@github-actions[bot]) Update manifest (#227) (@github-actions[bot]) CompatHelper: bump compat for ObjectiveC to 1, (keep existing compat) (#228) (@github-actions[bot]) Update manifest (#230) (@github-actions[bot]) Fix argument types in sincos (#232) (@fjebaker) Update manifest (#233) (@github-actions[bot]) Improve docs (#235) (@christiangnrd) Remove linear algebra section of MPS docs (#237) (@christiangnrd) CompatHelper: bump compat for GPUCompiler to 0.22, (keep existing compat) (#238) (@github-actions[bot]) Port openlibm log1pf as log1p (#239) (@sotlampr) Port openlibm erf (#240) (@tgymnich) Remove 1.6-era override mechanism. (#241) (@maleadt) CompatHelper: add new compat entry for Requires at version 1, (keep existing compat) (#242) (@github-actions[bot]) Update manifest (#243) (@github-actions[bot]) enable dependabot for GitHub actions (#244) (@ranocha) Bump actions/checkout from 2 to 3 (#245) (@dependabot[bot]) Bump peter-evans/create-pull-request from 3 to 5 (#246) (@dependabot[bot]) Show METAL_CAPTURE_ENABLED in Metal.versioninfo() when the environment variable is set (#248) (@christiangnrd) Update manifest (#249) (@github-actions[bot]) Adapt to GPUCompiler.jl, and other small updates. (#250) (@maleadt) Switch to GPUArrays buffer management. (#251) (@maleadt) Update manifest (#252) (@github-actions[bot]) Update manifest (#253) (@github-actions[bot]) Bump GPUCompiler (#255) (@maleadt) Closed issues: Random access indexing into MtlArray views cause scalar indexing (#149) Q: How to debug kernels - KA.@print? (#223) Crash during MTLDispatchListApply (#225) Unable to compile trig functions through ForwardDiff (#229) symbol multiply defined! Bug/crash on Julia master, fine on 1.10 (#231) log1p fails on MtlArray{Float32} (#234) When precompiling, UndefVarError: CompilerConfig not defined (#247)If you use this software, please cite it as below

    Flexible performant GEMM kernels on GPUs

    No full text
    General Matrix Multiplication or GEMM kernels take centre place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA’s Tensor Cores. Their exploitation is hampered by the two-language problem: it requires either low-level programming which implies low programmer productivity or using libraries that only offer a limited set of components. Because rephrasing algorithms in terms of established components often introduces overhead, the libraries’ lack of flexibility limits the freedom to explore new algorithms. Researchers using GEMMs can hence not enjoy programming productivity, high performance, and research flexibility at once. In this paper we solve this problem. We present three sets of abstractions and interfaces to program GEMMs within the scientific Julia programming language. The interfaces and abstractions are co-designed for researchers’ needs and Julia’s features to achieve sufficient separation of concerns and flexibility to easily extend basic GEMMs in many different ways without paying a performance price. Comparing our GEMMs to state-of-the-art libraries cuBLAS and CUTLASS, we demonstrate that our performance is in the same ballpark of the libraries, and in some cases even exceeds it, without having to write a single line of code in CUDA C++ or assembly, and without facing flexibility limitations

    Flexible Performant GEMM Kernels on GPUs

    Full text link
    General Matrix Multiplication or GEMM kernels take centre place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA's Tensor Cores. Their exploitation is hampered by the two-language problem: it requires either low-level programming which implies low programmer productivity or using libraries that only offer a limited set of components. Because rephrasing algorithms in terms of established components often introduces overhead, the libraries' lack of flexibility limits the freedom to explore new algorithms. Researchers using GEMMs can hence not enjoy programming productivity, high performance, and research flexibility at once. In this paper we solve this problem. We present three sets of abstractions and interfaces to program GEMMs within the scientific Julia programming language. The interfaces and abstractions are co-designed for researchers' needs and Julia's features to achieve sufficient separation of concerns and flexibility to easily extend basic GEMMs in many different ways without paying a performance price. Comparing our GEMMs to state-of-the-art libraries cuBLAS and CUTLASS, we demonstrate that our performance is in the same ballpark of the libraries, and in some cases even exceeds it, without having to write a single line of code in CUDA C++ or assembly, and without facing flexibility limitations.Comment: This paper was submitted to IEEE TPD

    Effective extensible programming : unleashing Julia on GPUs

    No full text
    GPUs and other accelerators are popular devices for accelerating compute-intensive, parallelizable applications. However, programming these devices is a difficult task. Writing efficient device code is challenging, and is typically done in a low-level programming language. High-level languages are rarely supported, or do not integrate with the rest of the high-level language ecosystem. To overcome this, we propose compiler infrastructure to efficiently add support for new hardware or environments to an existing programming language. We evaluate our approach by adding support for NVIDIA GPUs to the Julia programming language. By integrating with the existing compiler, we significantly lower the cost to implement and maintain the new compiler, and facilitate reuse of existing application code. Moreover, use of the high-level Julia programming language enables new and dynamic approaches for GPU programming. This greatly improves programmer productivity, while maintaining application performance similar to that of the official NVIDIA CUDA toolkit
    corecore